Data Visualization on Honey Production dataset (GLCA- DA- CHN- NOV '23) - Madhumitha.E¶

Week-4

PART II:¶

Data Visualization on Honey Production dataset using seaborn and matplotlib libraries.

Objective:¶

The Goal is to use Python visualization libraries such as seaborn and matplotlib to investigate the data and get some useful conclusions.

Attribute Information:¶

Slno. Attribute Description

  1. numcol Number of honey producing colonies.
  2. yield percol Honey yield per colony. (Unit is pounds)
  3. total prod Total production (numcol x yieldpercol). (Unit is pounds)
  4. price per lb Refers to average price per pound based on expanded sales. Unit is dollars.
  5. prodvalue Value of production (total prod x priceperlb). Unit is dollars.
  6. Stocks Refers to stocks held by producers. Unit is pounds
  7. Year Calendar year.
  8. State Different states' names.

1. Import required libraries and read the dataset¶

In [1]:
# importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn as sk
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
In [2]:
# reading the dataset
df=pd.read_csv("honeyproduction (1).csv")
df
Out[2]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
0 AL 16000.0 71 1136000.0 159000.0 0.72 818000.0 1998
1 AZ 55000.0 60 3300000.0 1485000.0 0.64 2112000.0 1998
2 AR 53000.0 65 3445000.0 1688000.0 0.59 2033000.0 1998
3 CA 450000.0 83 37350000.0 12326000.0 0.62 23157000.0 1998
4 CO 27000.0 72 1944000.0 1594000.0 0.70 1361000.0 1998
... ... ... ... ... ... ... ... ...
621 VA 4000.0 41 164000.0 23000.0 3.77 618000.0 2012
622 WA 62000.0 41 2542000.0 1017000.0 2.38 6050000.0 2012
623 WV 6000.0 48 288000.0 95000.0 2.91 838000.0 2012
624 WI 60000.0 69 4140000.0 1863000.0 2.05 8487000.0 2012
625 WY 50000.0 51 2550000.0 459000.0 1.87 4769000.0 2012

626 rows × 8 columns

2. Check the first few samples, shape, info of the data and try to familiarize yourself with different features.¶

In [3]:
# checking the shape of this dataset
df.shape
Out[3]:
(626, 8)
In [4]:
# checking the size of this dataset
df.size
Out[4]:
5008
In [5]:
# getting random samples
df.sample(5)
Out[5]:
state numcol yieldpercol totalprod stocks priceperlb prodvalue year
620 VT 4000.0 60 240000.0 53000.0 2.39 574000.0 2012
342 WY 40000.0 56 2240000.0 291000.0 0.89 1994000.0 2005
163 SD 235000.0 65 15275000.0 12220000.0 0.71 10845000.0 2001
118 PA 25000.0 45 1125000.0 630000.0 0.76 855000.0 2000
326 NM 7000.0 49 343000.0 113000.0 1.03 353000.0 2005
In [6]:
# checking the dtypes
df.dtypes
Out[6]:
state           object
numcol         float64
yieldpercol      int64
totalprod      float64
stocks         float64
priceperlb     float64
prodvalue      float64
year             int64
dtype: object
In [7]:
# Examining the information of the Honey production dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 626 entries, 0 to 625
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   state        626 non-null    object 
 1   numcol       626 non-null    float64
 2   yieldpercol  626 non-null    int64  
 3   totalprod    626 non-null    float64
 4   stocks       626 non-null    float64
 5   priceperlb   626 non-null    float64
 6   prodvalue    626 non-null    float64
 7   year         626 non-null    int64  
dtypes: float64(5), int64(2), object(1)
memory usage: 39.2+ KB
In [17]:
# summary statistics 
df.describe()
Out[17]:
numcol yieldpercol totalprod stocks priceperlb prodvalue year
count 626.000000 626.000000 6.260000e+02 6.260000e+02 626.000000 6.260000e+02 626.000000
mean 60284.345048 62.009585 4.169086e+06 1.318859e+06 1.409569 4.715741e+06 2004.864217
std 91077.087231 19.458754 6.883847e+06 2.272964e+06 0.638599 7.976110e+06 4.317306
min 2000.000000 19.000000 8.400000e+04 8.000000e+03 0.490000 1.620000e+05 1998.000000
25% 9000.000000 48.000000 4.750000e+05 1.430000e+05 0.932500 7.592500e+05 2001.000000
50% 26000.000000 60.000000 1.533000e+06 4.395000e+05 1.360000 1.841500e+06 2005.000000
75% 63750.000000 74.000000 4.175250e+06 1.489500e+06 1.680000 4.703250e+06 2009.000000
max 510000.000000 136.000000 4.641000e+07 1.380000e+07 4.150000 6.961500e+07 2012.000000
In [8]:
# checking if there's null value presented
df.isnull().sum()
Out[8]:
state          0
numcol         0
yieldpercol    0
totalprod      0
stocks         0
priceperlb     0
prodvalue      0
year           0
dtype: int64
In [9]:
# verifying if there's duplicate value presented
df.duplicated().sum()
Out[9]:
0

Inference:

  1. The dataset consists of 626 entries with 8 columns representing different features.
  2. The 'state' column is of object type, indicating categorical data.
  3. The 'numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb', and 'prodvalue' columns are of numeric types (float64 and int64).
  4. There are no missing values in any of the columns, as the non-null counts for all columns are 626.
  5. The dataset contains information about honey-related metrics, including the number of colonies, yield per colony, total production, stocks, price per pound, production value, and the year of observation.

3. Display the percentage distribution of the data in each year using the pie chart.¶

In [10]:
pie_chart=px.pie(df,values='year',names='year', title="Percentage of Honey Distribution over the Years-(Pie Chart)",labels=df['year'].value_counts().index)
# Show percentages and labels inside
pie_chart.update_traces(textposition='inside', textinfo='percent+label')  
pie_chart.show()

Inference:

  1. The pie chart indicates stable honey distribution over the years, with minimal fluctuations.
  2. Peak honey distribution occurred in 2001, 2002 and 2003, each accounting for 7.02%.
  3. Conversely, 2009 and 2010 saw the lowest distribution percentages, at 6.4% and 6.41%, suggesting less productivity.

4. Plot and Understand the distribution of the variable "price per lb" using displot, and write your findings.¶

In [11]:
sns.displot(df, x='priceperlb', kde=True)
plt.title('Distplot of Price per lb')
plt.show()

Inference: 1.The distribution of honey prices per pound is right-skewed, indicating a non-normal distribution with a notable group of higher-priced honey. 2.The peak around $ 1.60 suggests that this is the most common price per pound. However, the wide spread from $ 0.80 to $ 4.00 showcases the diverse range of honey prices in the dataset.

5. Plot and understand the relationship between the variables 'numcol' and 'prodval' through scatterplot, and write your findings.¶

In [12]:
px.scatter(df, x='numcol', y='prodvalue', title="Scatter Plot of numcol vs. prodvalue", trendline="ols")

Inference:

  1. A positive correlation exists between the number of honey-producing colonies ('numcol') and higher production values ('prodval').
  2. The relationship is not perfectly linear, suggesting the influence of other factors on production value.
  3. Higher 'numcol' values show a wider range of production values, indicating additional contributing factors.
  4. Outliers with low 'numcol' values and unexpectedly high production values emphasize the presence of influential factors beyond colony numbers.

6. Plot and understand the relationship between categorical variable 'year' and a numerical variable 'prodvalue' through boxplot, and write your findings.¶

In [13]:
px.box(df, x='year', y='prodvalue',title="Boxplot of prodvalue vs year")

Inference:

  1. Significant differences in 'prodvalue' exist between years, with non-overlapping boxes indicating distinct distributions.
  2. 'Prodvalue' demonstrates an overall increasing trend over time, as seen in the ascending median values across the years.
  3. Each year exhibits a considerable variation in 'prodvalue,' represented by the wide boxes indicating a spread within each year.
  4. Presence of potential outliers is observed, as some data points fall outside the whiskers, indicating extreme 'prodvalue' values for certain years.

7. Visualize and understand the relationship between the multiple pairs of variables throughout different years using pairplot and add your inferences. (use columns 'numcol', 'yield percol', 'total prod', 'prodvalue','year')¶

In [14]:
sns.pairplot(df[['numcol', 'yieldpercol', 'totalprod','prodvalue','year']],hue="year",diag_kind="kde",corner=True)
plt.figure(figsize=(10, 10))
Out[14]:
<Figure size 1000x1000 with 0 Axes>
<Figure size 1000x1000 with 0 Axes>

Inference:

1.numcol vs. yield percol:A weak positive correlation suggests that, in general, higher numcol corresponds to higher yield percol, but with considerable scatter. 2.numcol vs. total prod: Strong positive correlation indicates that higher numcol leads to higher total prod, as expected. 3.numcol vs. prodvalue: Moderate positive correlation implies that higher numcol tends to result in higher prodvalue, but other factors play a role. 4.yield percol vs. total prod: Strong positive correlation implies a direct relationship, as total prod is the product of numcol and yield percol. 5.yield percol vs. prodvalue: Moderate positive correlation suggests that higher yield percol tends to lead to higher prodvalue, but other factors contribute. 6.total prod vs. prodvalue:Strong positive correlation indicates that higher total prod corresponds to higher prodvalue, as prodvalue is calculated based on total prod.

8. Display the correlation values using a plot and add your inferences. (use columns 'numcol', 'yield percol','total prod', 'stocks', 'price per lb', 'prodvalue')¶

In [15]:
sns.pairplot(df[['numcol', 'yieldpercol','totalprod', 'stocks', 'priceperlb', 'prodvalue']],diag_kind="kde",kind="reg",corner=True)
plt.figure(figsize=(10, 10))
Out[15]:
<Figure size 1000x1000 with 0 Axes>
<Figure size 1000x1000 with 0 Axes>
In [16]:
columns = ['numcol', 'yieldpercol', 'totalprod', 'stocks', 'priceperlb', 'prodvalue']
# Calculate the correlation matrix
correlation_matrix = df[columns].corr()
# Create a heatmap
plt.figure(figsize=(5, 5))
sns.heatmap(correlation_matrix,annot=True)
plt.title('Correlation Plot of Selected Columns')
plt.show()

Inferences:

Strong positive correlations:

  1. numcol and total_prod: As expected, an increase in the number of colonies results in higher total production.
  2. total_prod and prodvalue:Consistent with the calculation, higher total production corresponds to higher production value.

Strong negative correlations:

  1. numcol and stocks: States with more colonies tend to have fewer stocks, possibly due to quicker honey sales.
  2. stocks and prodvalue: Higher stocks are associated with lower production value, indicating a potential impact on sales.

Weak correlations:

  1. numcol and yield_percol: A weak positive correlation suggests a limited impact of colony number on honey yield per colony.
  2. yield_percol and prodvalue: A weak positive correlation implies that honey yield per colony has a minor influence on production value, possibly due to the role of price per pound.
  3. Price_per_lb with most other variables:Weak correlations suggest that the price per pound is not a major factor influencing other variables.